Transcription Factor Binding Site Detection Algorithm Using Distance Metrics Based on a Position Frequency Matrix Concept
نویسندگان
چکیده
Regulatory sequence detection is a fundamental challenge in computational biology. The transcription process in protein synthesis starts with the binding of the transcription factor (TF) to its binding site. These binding sites are short DNA segments that are called motifs. Different sites can bind to the same factor. This variability in binding sequences besides their low information content and low specificity increases the difficulty of their detection using computational algorithms. This paper proposes a novel algorithm for transcription factor binding sites (TFBSs) detection in the entire genomic structure and allow discovery of new motif sequences. This is achieved by using distance metrics based on a position frequency matrix (PFM) concept that quantify the similitude between the set of conserved sequences belonging to a particular TF and the entire DNA sequence under study. Hence, the PFM in this context can be thought of as a consensus sequence as it provides a representative measure of the said set of binding sites belonging to a particular TF. The algorithm then quantifies the correlation between the PFM and each binding site belonging to a given TF. Same scenario is then applied to the genome sequence under study. The obtained distance metrics are then utilized to discover new potential TFBSs based on their similitude of the set of binding sites investigated. Analysis is applied to Escherichia coli (E. coli) bacterial genomes. Simulation results verify the correctness and the biological relevance of the proposed algorithm.
منابع مشابه
A graph-based motif detection algorithm models complex nucleotide dependencies in transcription factor binding sites
Given a set of known binding sites for a specific transcription factor, it is possible to build a model of the transcription factor binding site, usually called a motif model, and use this model to search for other sites that bind the same transcription factor. Typically, this search is performed using a position-specific scoring matrix (PSSM), also known as a position weight matrix. In this pa...
متن کاملModeling within-motif dependence for transcription factor binding site predictions
MOTIVATION The position-specific weight matrix (PWM) model, which assumes that each position in the DNA site contributes independently to the overall protein-DNA interaction, has been the primary means to describe transcription factor binding site motifs. Recent biological experiments, however, suggest that there exists interdependence among positions in the binding sites. In order to exploit t...
متن کاملSimilarity of position frequency matrices for transcription factor binding sites
MOTIVATION Transcription-factor binding sites (TFBS) in promoter sequences of higher eukaryotes are commonly modeled using position frequency matrices (PFM). The ability to compare PFMs representing binding sites is especially important for de novo sequence motif discovery, where it is desirable to compare putative matrices to one another and to known matrices. RESULTS We describe a PFM simil...
متن کاملAn Empirical Prior Improves Accuracy for Bayesian Estimation of Transcription Factor Binding Site Frequencies within Gene Promoters
A Bayesian method for sampling from the distribution of matches to a precompiled transcription factor binding site (TFBS) sequence pattern (conditioned on an observed nucleotide sequence and the sequence pattern) is described. The method takes a position frequency matrix as input for a set of representative binding sites for a transcription factor and two sets of noncoding, 5' regulatory sequen...
متن کاملANFIS-based Fuzzy Systems for Searching DNA-Protein Binding Sites
Transcriptional regulation mainly controls how genes are expressed and how cells behave based on the transcription factor (TF) proteins that bind upstream of the transcription start sites (TSSs) of genes. These TF DNA binding sites (TFBSs) are usually short (5-15 base pairs) and degenerate (some positions can have multiple possible alternatives). Traditionally, computational methods scan DNA se...
متن کامل